String Processing Algorithms

نویسنده

  • Shunsuke Inenaga
چکیده

The thesis describes extensive studies on various algorithms for efficient string processing. Data available in/via computers are often of enormous size, and thus, it is significantly important and necessary to invent timeand space-efficient methods to process them. Most of such data are, in fact, stored and manipulated as strings. String matching is most fundamental in string processing, where the problem is to examine whether or not a pattern string p occurs in a text string w. There are two cases to consider; p is fixed and w is flexible, and vise versa. In the former case, it is adequate to employ the noble algorithm by Knuth, Morris, and Pratt that solves the problem in O(|w|) time using O(|p|) space. The thesis, on the other hand, considers the latter case. When w is fixed, it is natural, and ideal, to use a data structure that supports indices of w. Such a data structure is called an index structure. A linear-spaced index structure was first given by Weiner in 1973, named suffix trees. Suffix trees are regarded as a compaction of suffix tries that are a basic index structure requiring quadratic space. On the other hand minimizing suffix tries yields another type of index structure called directed acyclic word graphs (DAWGs), which was introduced by Blumer et al. in 1985. Moreover, by either minimizing suffix trees or compacting DAWGs gives us compact directed acyclic word graphs (CDAWGs). CDAWGs were also invented by Blumer et al. in 1987. In the thesis we delve in those index structures, revealing their relationships in terms of equivalence classes on strings. After giving such theoretical characteristics of them, we explore ingenious algorithms related to those index structures for timeand space-efficient string processing in practice. Particularly, we first introduce an on-line algorithm that directly constructs a CDAWG for a single string w in O(|w|) time, and second, give its straightforward extension to a set S of strings whose running time is O(‖S‖) where ‖S‖ denotes the total length of the strings in S. A further, deeper analysis of the on-line algorithm gives us a generalized

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Electromagnetism-like Algorithms for The Fuzzy Fixed Charge Transportation Problem

In this paper, we consider the fuzzy fixed-charge transportation problem (FFCTP). Both of fixed and transportation cost are fuzzy numbers. Contrary to previous works, Electromagnetism-like Algorithms (EM) is firstly proposed in this research area to solve the problem. Three types of EM; original EM, revised EM, and hybrid EM are firstly employed for the given problem. The latter is being firstl...

متن کامل

Parallel String Matching Algorithms

String-matching cannot be done by a two-head one-way deterministic nite automaton, Information Processing Letters 22, 231-235.

متن کامل

String comparison by transposition networks

Computing string or sequence alignments is a classical method of comparing strings and has applications in many areas of computing, such as signal processing and bioinformatics. Semi-local string alignment is a recent generalisation of this method, in which the alignment of a given string and all substrings of another string are computed simultaneously at no additional asymptotic cost. In this ...

متن کامل

State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming

String matching algorithms are an important piece in the network intrusion detection systems. In these systems, the chain coincidence algorithms occupy more than half the CPU process time. The GPU technology has showed in the past years to have a superior performance on these types of applications than the CPU. In this article we perform a review of the state of the art of the different string ...

متن کامل

To Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms

String matching is an important part in today’s computer applications and Aho-Corasick algorithm is one of the main string matching algorithms used to accomplish this. This paper discusses that when can the GPUs be used for string matching applications using the Aho-Corasick algorithm as a benchmark. We have to identify the best unit to run our string matching algorithm according to the perform...

متن کامل

Parallel String Matching with Multi Core Processors-A Comparative Study for Gene Sequences

The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task, including text processing, information retrieval, text mining, pattern recognition and DNA sequencing. So we need concurrent events and high performance computing models for extracting the data. This w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003